-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLB map setup per chip #179
Conversation
I will give this PR a more thorough review, but my initial reaction is that the relationship between TLB index and (x,y) location is something UMD should hide. UMD has to know about the TLB windows, there's no way around that. But the application does not. It shouldn't matter what TLB is mapped where as long as UMD/KMD manage these resources effectively. |
I agree, but bear in mind that this is not what this PR is about. This is not a final API, it is just pointing in the right direction, that TLB setup is chip specific. We might end up removing it altogether in the end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the intent behind the change, but it expands an existing pattern that I regard as a core UMD design flaw: the pervasive use of hash tables for per-chip information lookup. This pattern - which exists in UMD independent of this change - undermines performance and has led to workarounds (get_fast_pcie_static_tlb_write_callable
and tt::Writer
).
As an interim change along a pathway to a better API, I have no objection.
I completely agree. But it is fine for me to have a more flexible API for slower access, and a performant option (such as tt::Writer, or as you suggested TLBWindow, which I ended up suggesting be AbstractIO in the end). On the other hand I do want to investigate maps that we use a bit. Theoretically, it shouldn't be expensive to use a hash map. It is known to have a bad implementation in stdlib though, there are much better options. I also suspect that just changing everything to map and set would improve perf (they should be more performant for very small containers, which are ours). Hash maps are not slow by design, and I'd very much like to get to the bottom of where the perf issue really is. |
Related tenstorrent/tt-metal#13949 got approved, so now merging this PR. |
40710a0
to
a4e6177
Compare
Currently tt_SiliconDevice accepts one map for everything. But that goes against our effort to separate cluster vs chip responsibilities, related to #157
This change includes:
Contributes to #154 since it adds more api tests.
This change will require tt_metal changes.